Dataset:
Amazon Fine Food Reviews dataset is downloaded from Kaggle and placed in my google drive for programmatic retrieval. It consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~568,000. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories. You can read more about the data at SNAP.
Approach:
Many a time in business, as well as in academia, we don't have the luxury of proper data pipelines to decide the scope and the value associated with data beforehand, and often we derive value out of already collected, sometimes good but often messy, data. That was the case for this analysis, and that is why you'd find the Goal of the project in section 3.
An analysis is a sequential process, which involves a lot of questions back and forth. I've tried to capture this natural question and answer process by appropriately documenting the question and the answer to the question either through text or through python commands.
I think this approach at least serves two purposes first it gives me an option to analyze my thinking, to see what kind of question I am asking and to track my approach to a problem. Secondly, some times when we read someone else's analysis we find it too hard to decipher some of their actions, this is an intuitive way where the question precedes an action, thereby self-explaining the action at each step. I've tried to keep the questions as natural as possible.
Sometimes 4 or 5 questions into some section of the analysis you may feel I could've skipped questions 1, 2 & 3 and reached straight to questions 4 & 5 but when we do analysis it seldom happens that you reach straight to the goal, it is often preceded by some missteps or extra steps before where we eventually reach where we reach, and I am attempting to document those missteps and extra steps.
I would love for you to read the analysis in its entirety, but it is possible that you may feel that the analysis is too lengthy and that you may not have enough time to go through the entire analysis. If that's case I would strongly suggest you atleast read the summary part( section 9.1).
Key Results:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import re
from gensim.models import Word2Vec, KeyedVectors
import tensorflow as tf
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization, LSTM, Embedding
from tensorflow.keras.callbacks import TensorBoard
import time
from random import sample
from wordcloud import WordCloud, STOPWORDS
import seaborn as sns
# Connecting to Google Drive
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv("drive/My Drive/colab_files/Amazon_Reviews/Reviews.csv")
df.head()
2.1 Question: How many unique values ?
df.nunique()
2.2 Question: Are there any null values ?
df.isna().sum()
2.3 Question: How the scores are distributed ?
From the below plots the dataset is heavily skewed towards positive rating, this has to be kept in mind before any assertions are drawn.
def GetRelativeFrequency(df):
series = df.Score.value_counts() / len(df)
indx = np.sort(df.Score.unique())
plt.bar(indx, series*100)
plt.ylabel('%')
plt.xlabel('Ratings')
plt.title('Relative fequency')
plt.xticks(indx, series.index)
for i, v in enumerate(series*100):
plt.text(indx[i] - 0.25, v + 0.01, str(round(v,2)))
GetRelativeFrequency(df)
Given that the dataset has very limited features the scope of the analysis is severely restricted. More details about the user (like the users demographic) would have opened doors for all kinds of analysis, but still the textual data present can be used to develop a sentiment detector or classifier.
I'd like to build a classifier which uses the whole review to classify the sentiments broadly into 2 categories Positive(4,5) and Negative(1,2,3), afterward ideally I'd like to test this upon food reviews in twitter where there're no ratings accompanying text, but let's condition this upon the availability of data. Further, I also, if time permits, would like to add more granularity to classification task but I believe this will reduce the practical implementation as it will needlessly shrink the generality of the classifier thereby messing up the practicality aspect.
Further, I'd like to analyze in general what are the type of phrases or words that make a review positive or negative and check if this reconciles with our understanding of language.
Question 4.1: is the data balanced? if not do you want to artificially balance?
No, the data has severe imbalance towards the positive review. For ideal results and for the ability to give the model a honest chance to learn things, input classes should be appropriately weighted, otherwise the model would find it easy and beneficial for itself to classify all things postive.
Question 4.2: What approach do I want to take balance data ?
For now lets proceed with weighted loss function where I'll weigh the loss function such that classifier classfies both the sentiments with equal effect.
Question 4.3: Does the data needs to be prepared or preprocessed before it is fed to the model ?
Yes.
#Cleaning Text column by lowering all the text and removing all html tags and punctuations
def CleanText(series):
series = series.str.lower()
#Removing html tags
series = series.apply(lambda x: re.compile(r'<[^>]+>').sub('', str(x)))
#Removing Punctuations
series = series.str.replace('[^a-zA-Z ]', '')
return series
df.Text = CleanText(df.Text)
Question 4.3.1: Are the texts cleaned correctly ?
df.Text[55555]
Question 4.3.2: Why all words from all of the reviews are put into a list ?
Because the Gensim Word2Vec function takes words as a list.
#Creating Vocabulary & Word Embeddings using
words = [i.split() for i in df.Text.values]
Question 4.3.3: Why do we need word embeddings?
Word embeddings are word vectors with n dimensions, i.e. mapppings of all uniuqe words in a n-dimensional space. Using this mapping we can find similarity and dissimilarity between the words across the dimensions. Word embedding adds the much required context to each words, therfore immensely useful for Natural Language Processing tasks. Without word embeddings all the vectors are one hot encoded vectors thus containing no information about relation between words.
w2v = Word2Vec(words, min_count =1, size=300)
Question 4.3.4: What does the Word2Vec object look like ?
",".join(w2v.wv.__dict__.keys())
Question 4.3.5: Are the dimensions of the word embeddings as expected (i.e = 300) ?
w2v.wv.vectors.shape
Question 4.3.6: Are the index & words mapped correctly to each other ?
print(w2v.wv.index2word[542])
print(w2v.wv.vocab['cannot'].index)
Question 4.3.7: What does a weight look like ?
print(w2v.wv.vectors[10,:].shape)
w2v.wv.vectors[10,:]
Question 4.3.8: What else needs to be done ?
Texts are required to be converted to index.
seq_len = 100 # mandatory sequence length
unique_words_len = w2v.wv.vectors.shape[0]
# Function to pad reviews with less than 100 words
def PadSequence(sequence):
pad_by = seq_len - len(sequence)
for i in range(pad_by):
sequence.append([unique_words_len])
return sequence
# Function to get Network feedible data
def GetTrainingData(df):
X = []
Y = []
for row in df.values:
Y.append(1 if row[-4] > 3 else 0)
sequence = [[w2v.wv.vocab[x].index] for x in row[-1].split()[:seq_len]]
X.append(PadSequence(sequence))
return X,Y
# Get Data & labels
X,Y = GetTrainingData(df)
# Converting data and labels into feedible format
X = np.array(X).reshape(len(X), seq_len)
Y = np.array(Y)
Question 4.3.9: Why are the word embeddings manipulated in below cell ?
Word embeddings originally had rows = number of unique words and columns = embedding size. An extra column of zero has been added to account for the padded sequences.
vocab = w2v.wv.vocab
index2word = w2v.wv.index2word
vectors = w2v.wv.vectors
word_emb = np.zeros(( vectors.shape[0]+1, vectors.shape[1]))
word_emb[:vectors.shape[0],:] = vectors
# Saving the data so that I don't have to run through above steps again
# when runtime( or enivronment) gets disconnected
np.save("drive/My Drive/colab_files/Amazon_Reviews/X.npy", X) # data
np.save("drive/My Drive/colab_files/Amazon_Reviews/Y.npy", Y) # labels
np.save("drive/My Drive/colab_files/Amazon_Reviews/emb.npy",w2v.wv.vectors) # word embeddings
# Loading the data from drive
X = np.load("drive/My Drive/colab_files/Amazon_Reviews/X.npy") # data
Y = np.load("drive/My Drive/colab_files/Amazon_Reviews/Y.npy") # labels
vectors = np.load("drive/My Drive/colab_files/Amazon_Reviews/emb.npy") # word embeddings
Question 4.3.10: What does the last column look like ?
word_emb[-1,:]
Question 4.3.11: What does the second last column look like ?
word_emb[-2,:]
Question 4.3.12: is the shape of manipulated word embedding correct ?
word_emb.shape
Question 4.3.13: Are the input & output ready to be fed in the model ?
Yes.
Question 4.3.14: What does the input & output shape look like ?
print(f"Data Shape: {X.shape}")
print(f"Labels Shape: {Y.shape}")
Question 5.1: What type of classifier should I use traditional ones or deep learning models ?
Deep Learning model. Since the number of records is very high, a deep neural network is likely to perform much better than traditional ones.
Question 5.2: What type of Deep Learning Model is suitable for this task ? Plain Neural Network, Convolutional Neural Network or Recurrent Neural Network ?
This is a no-brainer. Since this task involves sequence data, a RNN model is likely to perform much better for this task.
Question 5.3: How many layers do you want to use ?
I don't know. I'll start with a basic model and iteratively change layers and other hyperparameters as needed.
def GetModel(shape):
return Sequential([
Embedding(word_emb.shape[0],word_emb.shape[1], input_length = seq_len, weights = [word_emb], trainable = False),
LSTM(128, return_sequences=True, input_shape=(shape)),
Dropout(0.2),
BatchNormalization(),
LSTM(128, return_sequences=True),
Dropout(0.1),
BatchNormalization(),
LSTM(128),
Dropout(0.2),
BatchNormalization(),
Dense(1, activation='sigmoid')
])
model = GetModel(X.shape[1:])
opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)
# Compile model
model.compile(
loss="binary_crossentropy",
optimizer=opt,
metrics=["accuracy"]
)
NAME = f"Model-{int(time.time())}"
tensorboard = tf.keras.callbacks.TensorBoard(log_dir="logs/{}".format(NAME))
class_weights = {0:4,1:1} # Class Weights to address data imbalance issue
model.fit(
X, Y,
batch_size=64,
epochs=5,
validation_split=0.02,
callbacks = [tensorboard],
class_weight=class_weights
)
Question 5.4: Am I happy with model ?
Overall 90% accuracy on train and validation set is quite good if not great. But I would reserve my judgement until I do further analysis on model performance.
Question 5.5: Why are you saving the model ?
With Google colab you can access GPU only for a limited period and since it's a pain to train a RNN model using CPU, it is better we save the model and load it later when the runtime gets disconnected to analyse the model performance.
model.save("drive/My Drive/colab_files/Amazon_Reviews/model.h5")
model = load_model("drive/My Drive/colab_files/Amazon_Reviews/model.h5")
Question 5.6: what next ?
First I get prediction using model.predict method, then I convert the rowsx1 array into a vector matching the shape of the output vector, and then I save the predictions to drive so that I don't have to predict the output for 568k samples each time the runtime is disconnected.
pred = model.predict(X)
pred = np.reshape(pred, Y.shape)
np.save("drive/My Drive/colab_files/Amazon_Reviews/pred.npy",pred)
pred = np.load("drive/My Drive/colab_files/Amazon_Reviews/pred.npy", allow_pickle=True)
Question 5.7: What does the prediction output look like ?
It signifies the conditional probability P(Y = 1|X), which means the greater the probability the greater chance of the review being positive. if the output has probability greater than 0.5, then the inputted review is classified as a positive one and vice-versa.
pred[0]
Question 5.8: is the shape of prediction array correct ?
pred.shape
Question 5.9: Why am I rounding off probablilities?
Because our output has discrete values (0 and 1) and the network outputs the probability of p(y=1|x). I have to round it off to make it discrete. I use the cut-off probability as 0.5 i.e if the predicted probability is greater than 0.5 I will assign the output as 1 for that review and vice-versa.
pred_round = np.round(pred)
Question 6.1: How many reviews are wrongly classified?
incorrect = Y[pred_round != Y]
incorrect.shape
false_neg = sum(incorrect) # Calculating False Negatives
false_pos = len(incorrect) - false_neg # Calculating False Positives
Question 6.2: How many False Negatives?
false_neg
Question 6.3: How many False Positives?
false_pos
Question 6.4: What is the True Positive Rate / Sensitivity / Recall?
print(f"True Positive Rate: {round(1 - (false_neg/sum(Y==1)),4) * 100}")
Question 6.5: What is the True Negative Rate / Specificity ?
print(f"True Negative Rate: {round(1 - (false_pos/sum(Y==0)),4) * 100}")
Question 6.6: Is the model biased towards one class?
Since Sensitivity and Specificity of the model is quite similar, the model is not biased towards either of the class.
Question 6.7: Is the model overly predicting positive class(i.e. is the precision too low)?
No.
print(f"Precision: {round(1 - (false_pos/sum(pred_round==1)),4) * 100}")
# Segregating review indices
incorrect_ind_neg = list(np.where((Y != pred_round) & (Y == 0))[0]) #indices of incorrectly predicted negative reviews
#print(incorrect_ind_neg)
incorrect_ind_pos = list(np.where((Y != pred_round) & (Y == 1))[0]) #indices of incorrectly predicted positive reviews
#print(incorrect_ind_pos)
correct_ind_neg = list(np.where((Y == pred_round) & (Y == 0))[0]) #indices of correctly predicted negative reviews
#print(correct_ind_neg)
correct_ind_pos = list(np.where((Y == pred_round) & (Y == 1))[0]) #indices of correctly predicted positive reviews
#print(correct_ind_pos)
df["seq_len_text"] = df.Text.apply(lambda x: (" ").join(x.split()[:seq_len])) # Setting up new column that matches with the input data to the model
df.head()
# Method to get Word Cloud given a dataframe
def GetWordCloud(df):
text = ""
for i in df.values:
text = f"{text} {i[0]}"
stopwords = set(STOPWORDS)
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(text)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
# Method to sample reviews given indices & number of samples
def GetSampleReviewsWithDetails(ind, samples, col_indices = [-1, -5]):
ind = sample(ind,samples) # sampling from indices provided
for i in ind:
print(df.iloc[i,col_indices[0]])
print("\n")
print(f"Rating - {df.iloc[i, col_indices[1]]} || Predicted Probability - {pred[i]}")
print("\n")
Question 6.8: What does the correctly predicted positive reviews look like?
I cannot look into all reviews since the number is huge, therfore I sampled 3 values from the postive reviews which were correctly predicted as positive. All 3 review had a 5-star rating and going by the text nothing seems out of place.
GetSampleReviewsWithDetails(correct_ind_pos,3)
Question 6.9: What are the most popular words in correctly predicted positive reviews?
Even though word cloud by itself is not much useful for a sequence learning task but it provides a very broad but intuivite way to show most frequent words in a corpus(distribution) by it's size. Some of the high occuring words and phrases such as "highly recommended" and "tastes great" perfectly reconciles with the distribution it's coming from.
GetWordCloud(df.iloc[correct_ind_pos,-1].to_frame())
Question 6.10: What does the distribution of rating look like for correctly predicted positive reviews?
GetRelativeFrequency(df.iloc[correct_ind_pos,:])
Question 6.11: What does the correctly predicted negative reviews look like?
Again looking at the 3 randomly sampled negative reviews everything is working as expected.
GetSampleReviewsWithDetails(correct_ind_neg,3)
Question 6.12: What are the most popular words in correctly predicted negative reviews?
GetWordCloud(df.iloc[correct_ind_neg,-1].to_frame())
Question 6.13: What does the distribution of rating look like for incorrectly predicted positive reviews?
GetRelativeFrequency(df.iloc[correct_ind_neg,:])
Question 6.14: What does the incorrectly predicted positive reviews look like?
I've sampled 10 incorrectly predicted positive review and provided my labels(i.e. how I feel about the sentiment) on each of this review and if the model performance is explainable or satisfactory. Review # is the order in which the reviews are outputted in below cell
| # | Rating | Original Label | My Label | Predicted Probability | Model Performance |
|---|---|---|---|---|---|
| 1 | 4 | Positive | Incoherent, Mixed Sentiments | 0.18 | Satisfactory |
| 2 | 5 | Positive | Mixed Sentiments | 0.17 | Satisfactory |
| 3 | 4 | Positive | Mixed Sentiments | 0.26 | Satisfactory |
| 4 | 4 | Positive | Mixed Sentiments | 0.46 | Satisfactory |
| 5 | 4 | Positive | Suggestions | 0.31 | Satisfactory |
| 6 | 5 | Positive | Mixed Sentiments | 0.08 | Satisfactory |
| 7 | 5 | Positive | Suggestions | 0.43 | Satisfactory |
| 8 | 4 | Positive | Positive Sentiments, Suggestions | 0.32 | Satisfactory |
| 9 | 5 | Positive | Positive Sentiments | 0.18 | Not Satisfactory |
| 10 | 4 | Positive | Mixed Sentiments | 0.17 | Satisfactory |
GetSampleReviewsWithDetails(incorrect_ind_pos,10)
Question 6.15: What are the most popular words in incorrectly predicted positive reviews?
GetWordCloud(df.iloc[incorrect_ind_pos,-1].to_frame())
Question 6.16: What does the distribution of rating look like for incorrectly predicted positive reviews?
GetRelativeFrequency(df.iloc[incorrect_ind_pos,:])
Question 6.17: What does the incorrectly predicted negative reviews look like?
I've sampled 10 incorrectly predicted negative review and provided my labels(i.e. how I feel about the sentiment) on each of this review and if the model performance is explainable or satisfactory. Review # is the order in which the reviews are outputted in below cell, you can read the review there.
| # | Rating | Original Label | My Label | Predicted Probability | Model Performance |
|---|---|---|---|---|---|
| 1 | 3 | Negative | Negative Sentiments, Criticism | 0.86 | Not Satisfactory |
| 2 | 3 | Negative | Mixed Sentiments, Edited Review | 0.56 | Satisfactory |
| 3 | 3 | Negative | Positive Sentiments, Usage Caution | 0.77 | Satisfactory |
| 4 | 3 | Negative | Positive Sentiments | 0.69 | Satisfactory |
| 5 | 3 | Negative | Postive Sentiments | 0.77 | Satisfactory |
| 6 | 3 | Negative | Mixed Sentiments, Suggestion | 0.62 | Satisfactory |
| 7 | 3 | Negative | Positive Sentiment, Negative Prediction | 0.88 | Satisfactory |
| 8 | 3 | Negative | Positive Sentiments, Incoherent | 0.88 | Satisfactory |
| 9 | 2 | Negative | Negative Sentiments, Positive Expectations | 0.69 | Satisfactory |
| 10 | 3 | Negative | Positive Sentiments | 0.81 | Satisfactory |
GetSampleReviewsWithDetails(incorrect_ind_neg,10)
Question 6.18: What are the most popular words in incorrectly predicted negative reviews?
GetWordCloud(df.iloc[incorrect_ind_neg,-1].to_frame())
Question 6.19: What does the distribution of rating look like for incorrectly predicted negative reviews?
GetRelativeFrequency(df.iloc[incorrect_ind_neg,:])
Question 6.20: What does the distribution of predicted probablities look like for incorrectly and correctly predicted positive reviews?
sns.distplot(pred[incorrect_ind_pos], hist=True, kde=True,
bins=int(180/5), color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4},
axlabel = "Predicted Probabilities")
sns.distplot(pred[correct_ind_pos], hist=True, kde=True,
bins=int(180/5), color = 'darkorange',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4},
axlabel = "Predicted Probabilities")
Question 6.21: What does the distribution of predicted probablities look like for incorrectly and correctly predicted negative reviews?
sns.distplot(pred[incorrect_ind_neg], hist=True, kde=True,
bins=int(180/5), color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4},
axlabel = "Predicted Probabilities")
sns.distplot(pred[correct_ind_neg], hist=True, kde=True,
bins=int(180/5), color = 'darkorange',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4},
axlabel = "Predicted Probabilities")
Question 6.22: What does the distribution of predicted probablities look like for different ratings?
| Rating | Comments |
|---|---|
| 1 | Expectedly approximates exponential distribution with a sharp slope , nothing out of sorts. |
| 2 | Expectedly approximates exponential distribution with a slightly less sharper slope, nothing out of sorts |
| 3 | In practical these are neutral reviews but for this analysis I set these up as negative reviews, it is likely the network shifted the neutral sentiments towards zero to help itself classify these as negative review. This is explained by a slightly less sharper drop in the slope, incomaprison with predicted probabilities for ratings 1 & 2. It is likely that neutral reviews with negative undertones ended up having probabilities close to zero and neutral reviews with mixed sentiments shifted from the 0.4 - 0.5 range to 0.10 -0.3 range and like wise neutral reviews with positive sentiments probably shifted from 0.5 - 0.65 range to 0.3 - 0.5 range. |
| 4 | Probability distribution of predicted probablities for rating 4 expectedly follows negative exponential distribution, but the slope of the curve isn't as sharp as I'd want to be. It should be noted that around half of the wrongly predicted reviews were of rating 4, whereas proportion of reviews with rating 4 contributing to positive sentiments block was a mere ~21%. I cannot hypothesize what could be the reason behind this until I go through a substantial chunk of rating 4 reviews with low predicted probability. |
| 5 | Expectedly approximates negative exponential distribution with a sharpe slope , nothing out of sorts |
Overall everything seems to be working fine except for reviews with Rating 4, the number of the incorrect reviews is highly disproportionate.
f, axes = plt.subplots(2, 3, figsize=(12, 12))
axes[1][2].set_axis_off()
for i,r,c in [[1, [0,0],"red"],[2, [0,1],"orange"],[3, [0,2],"blue"],[4, [1,0],"yellow"],[5, [1,1],"green"]]:
sns.distplot(pred[df.Score == i], hist=True, kde=True,
bins=int(180/5), color = c,
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4},
axlabel = f"Predicted Probabilities for Rating - {i}",
ax=axes[r[0], r[1]])
Question 6.22.1: Are reviews with rating 4 have disproportionately high incorrect predictions ?
Yes.
df["pred_prob"] = pred
df["predicted_correctly"] = pred_round == Y
sns.countplot(x = "Score", hue = "predicted_correctly", data = df)
Question 6.22.2: What does the review with rating 4 and predicted probability less than 0.11 look like ?
Sampling 5 reviews from the said distribution
| # | Rating | Original Label | My Label | Predicted Probability | Model Performance |
|---|---|---|---|---|---|
| 1 | 4 | Positive | Mixed Sentiments, Comparison | 0.04 (too low) | Not Satisfactory |
| 2 | 4 | Negative | Mixed Sentiments, Suggestion, Expectations | 0.06 | Satisfactory |
| 3 | 4 | Positive | Negative Sentiments | 0.04 | Satisfactory |
| 4 | 4 | Positive | Negative Sentiments, Criticism | 0.08 | Satisfactory |
| 5 | 4 | Positive | Mixed Sentiments, Negative Inclination | 0.015 | Satisfactory |
index = list(df[(df.pred_prob < 0.11) & (df.Score == 4)].index)
GetSampleReviewsWithDetails(index,5, col_indices=[-3, -7])
Question 6.22.3: How are words distributed for reviews with rating 4 and predicted probability less than 0.11 ?
As explained above one can't read much from the word cloud but even a highre level presence of words such as "disappointed, prefer, hard, though" logically reconciles with the low predicted probability.
GetWordCloud(df[(df.pred_prob < 0.11) & (df.Score == 4)].seq_len_text.to_frame())
Question 6.22.4: What does the review with rating 4 and predicted probability between 0.11 - 0.21 look like ?
Sampling 5 reviews from the said distribution
| # | Rating | Original Label | My Label | Predicted Probability | Model Performance |
|---|---|---|---|---|---|
| 1 | 4 | Positive | Negative Sentiments, Incoherent | 0.20 | Satisfactory |
| 2 | 4 | Positive | Negative Sentiments, Comparison | 0.15 | Satisfactory |
| 3 | 4 | Negative | Mixed Sentiments, Negative Inclination | 0.16 | Satisfactory |
| 4 | 4 | Positive | Mixed Sentiments, Brief | 0.20 | Satisfactory |
| 5 | 4 | Positive | Negative Sentiments | 0.13 | Satisfactory |
index = list(df[(df.pred_prob >= 0.11) & (df.pred_prob < 0.21) & (df.Score == 4)].index)
GetSampleReviewsWithDetails(index,5, col_indices=[-3, -7])
Question 6.22.5: How are words distributed for reviews with rating 4 and predicted probability between 0.11 - 0.21 ?
Nothing much to make sense of here, again advocating why it would be counter-productive to assign significant value to word counts in a sequence learning task.
GetWordCloud(df[(df.pred_prob >= 0.11) & (df.pred_prob < 0.21) & (df.Score == 4)].seq_len_text.to_frame())
Question 6.22.6: What does the review with rating 4 and predicted probability between 0.21 - 0.31 look like ?
| # | Rating | Original Label | My Label | Predicted Probability | Model Performance |
|---|---|---|---|---|---|
| 1 | 4 | Positive | Negative Sentiments | 0.26 | Satisfactory |
| 2 | 4 | Positive | Mixed Sentiments, Caution | 0.15 | Satisfactory |
| 3 | 4 | Negative | Mixed Sentiments, Breif | 0.27 | Satisfactory |
| 4 | 4 | Positive | Mixed Sentiments, Negative Inclination | 0.27 | Satisfactory |
| 5 | 4 | Positive | Mixed Sentiments | 0.27 | Satisfactory |
index = list(df[(df.pred_prob >= 0.21) & (df.pred_prob < 0.31) & (df.Score == 4)].index)
GetSampleReviewsWithDetails(index,5, col_indices=[-3, -7])
Question 6.22.7: How are words distributed for reviews with rating 4 and predicted probability between 0.21 - 0.31 ?
Slightly different from previous distribution.
GetWordCloud(df[(df.pred_prob >= 0.21) & (df.pred_prob < 0.31) & (df.Score == 4)].seq_len_text.to_frame())
Question 6.22.8: What does the review with rating 4 and predicted probability between 0.31 - 0.41 look like ?
Sampling 5 reviews from the said distribution
| # | Rating | Original Label | My Label | Predicted Probability | Model Performance |
|---|---|---|---|---|---|
| 1 | 4 | Positive | Mixed Sentiments, Brief | 0.35 | Satisfactory |
| 2 | 4 | Positive | Positive Sentiments, Slight Incoherence | 0.34 | Not Satisfactory |
| 3 | 4 | Negative | Mixed Sentiments | 0.37 | Satisfactory |
| 4 | 4 | Positive | Mixed Sentiments, Brief | 0.36 | Satisfactory |
| 5 | 4 | Positive | Positive Sentiments, Suggestions | 0.39 | Satisfactory |
index = list(df[(df.pred_prob >= 0.31) & (df.pred_prob < 0.41) & (df.Score == 4)].index)
GetSampleReviewsWithDetails(index,5, col_indices=[-3, -7])
Question 6.22.9: How are words distributed for reviews with rating 4 and predicted probability between 0.31 - 0.41 ?
Not much different from previous distribution. Again not much useful.
GetWordCloud(df[(df.pred_prob >= 0.31) & (df.pred_prob < 0.41) & (df.Score == 4)].seq_len_text.to_frame())
Question 6.22.10: What does the review with rating 4 and predicted probability between 0.41 - 0.50 look like ?
Sampling 5 reviews from the said distribution
| # | Rating | Original Label | My Label | Predicted Probability | Model Performance |
|---|---|---|---|---|---|
| 1 | 4 | Positive | Mixed Sentiments, Positive Inclination | 0.49 | Satisfactory |
| 2 | 4 | Positive | Mixed Sentiments, Suggestions | 0.44 | Satisfactory |
| 3 | 4 | Negative | Positive Sentiments, Suggestion, Brief | 0.45 | Satisfactory |
| 4 | 4 | Positive | Mixed Sentiments, Positive Inclination | 0.43 | Satisfactory |
| 5 | 4 | Positive | Mixed Sentiments | 0.42 | Satisfactory |
index = list(df[(df.pred_prob >= 0.41) & (df.pred_prob < 0.50) & (df.Score == 4)].index)
GetSampleReviewsWithDetails(index,5, col_indices=[-3, -7])
Question 6.22.11: How are words distributed for reviews with rating 4 and predicted probability between 0.41 - 0.50 ?
Very similar to earlier distributions, hence not much useful
GetWordCloud(df[(df.pred_prob >= 0.41) & (df.pred_prob <= 0.50) & (df.Score == 4)].seq_len_text.to_frame())
Question 6.23: How satified am I with the model performance on the 25 sampled incorrectly predicted 4-star rating reviews ?
metric = ("Satisfactory " * 23).split()
metric.extend(("Not-Satisfactory " * 2).split())
sns.countplot(metric)
Question 6.24: How are my labels (i.e User Defined / adjudged) distributed for the 25 sampled incorrectly predicted 4-star rating reviews ?
f, axes = plt.subplots(figsize=(24, 6))
labels = ("Mixed-Sentiments " * 15).split() #
labels.extend(("Negative-Sentiments " * 6).split()) #
labels.extend(("Brief " * 5).split()) #
labels.extend(("Suggestion " * 4).split()) #
labels.extend(("Negative-Inclination " * 3).split()) #
labels.extend(("Positive-Sentiments " * 2).split()) #
labels.extend(("Positive-Inclination " * 2).split()) #
labels.extend(("Comparison " *2).split()) #
labels.extend(("Incoherent " * 2).split()) #
labels.extend(("Caution " * 1).split()) #
labels.extend(("Expectation " * 1).split()) #
labels.extend(("Criticism " * 1).split()) #
sns.countplot(labels)
Question 6.25: Can I proceed with the model ?
The Analysis of 25 sampled incorrectly predicted reviews with ratings 4 suggests that model is doing well even where it's not expected to do well. Yes I should proceed with the model.
In this section I will test the model by tasking it with previously unseen data from the web.
# Converting reviews to network feedible format
def GetSingleSequence(sentence):
sentence = sentence.lower()
#Removing html tags
sentence = re.compile(r'<[^>]+>').sub('', sentence)
#Removing Punctuations
sentence = re.sub(r'[^\w\s]','',sentence)
words = sentence.split()[:seq_len]
sequence = [[vocab[x].index] if x in vocab.keys() else [len(vocab)] for x in words]
sequence = PadSequence(sequence)
X = np.array(sequence).reshape(1,seq_len)
return X
Question 7.1: How does the model perform on lengthy positive review with overwhelmingly positive sentiments?
Review: These 1-min oats are a life saver. I love that it’s so practical to use. Usually in the morning I just pour in the amount I want and then add just some milk and put in microwave. While the coffee is brewing, it’ll be done at the time. I love that it doesn’t have any additives in it so I can add whatever I want depending on my mood. Some days I’ll add berries and banana, other days I’ll add chocolate chips and nuts or cinnamon and honey would do. I noticed that with a full bowl of these oats, my tummy is satisfied for longer so I cut down on eating unnecessary items in between.
Rating: 5
review = "These 1-min oats are a life saver. I love that it’s so practical to use. Usually in the morning I just pour in the amount I want and then add just some milk and put in microwave. While the coffee is brewing, it’ll be done at the time. I love that it doesn’t have any additives in it so I can add whatever I want depending on my mood. Some days I’ll add berries and banana, other days I’ll add chocolate chips and nuts or cinnamon and honey would do. I noticed that with a full bowl of these oats, my tummy is satisfied for longer so I cut down on eating unnecessary items in between."
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Pretty Good !!! ")
Question 7.2: How does the model perform on short positive review with overwhelmingly positive sentiments?
Review: I CAN'T STOP EATING FLAMIN HOT CHEETOS, they're my weakness, idc the presentation, I love them in every way. They have the perfect spicy flavor.
Rating: 5
review = "I CAN'T STOP EATING FLAMIN HOT CHEETOS, they're my weakness, idc the presentation, I love them in every way. They have the perfect spicy flavor. "
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Awesome !!! ")
Question 7.3: How does the model perform on medium length review with mixed sentiments?
Review: "Love Cheerios, especially the frosted kind. Only reason for 4 stars instead of 5 is the sugar content. They don't really contain much nutritional value for a complete breakfast but they sure are tasty."
Rating: 4
review = "Love Cheerios, especially the frosted kind. Only reason for 4 stars instead of 5 is the sugar content. They don't really contain much nutritional value for a complete breakfast but they sure are tasty."
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Great !! ")
Question 7.4: How does the model perform on medium length review with mixed sentiments with negative inclination?
Review:
I tried these to compare with Frosted Flakes. They were just ok, I still prefer Honey Nut. Bought at Kroger and they did have several Cheerio choices. Will keep buying the honey nut and would suggest others to the same!
Rating: 3
review = "I tried these to compare with Frosted Flakes. They were just ok, I still prefer Honey Nut. Bought at Kroger and they did have several Cheerio choices. Will keep buying the honey nut and would suggest others to the same!"
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Good !! ")
Question 7.5: How does the model perform on lengthy review with overwhelmingly Negative sentiments?
Review: These are wayyy too sugary for me. With all the added sugars, I don't really think these can even still be considered healthy. But since it's under the Cheerios name, many will just associate this with a healthy morning option. Definitely not a great start to your day, unless you were planning on working a sugar crash into your busy schedule. I never actually buy this cereal directly, I always end up getting it when I purchase those mini cereal variety boxes, and then I eat it at 2 a.m. when I am truly desperate for a snack. These are wayyy too sugary for me. With all the added sugars, I don't really think these can even still be considered healthy. But since it's under the Cheerios name, many will just associate this with a healthy morning option. Definitely not a great start to your day, unless you were planning on working a sugar crash into your busy schedule. I never actually buy this cereal directly, I always end up getting it when I purchase those mini cereal variety boxes, and then I eat it at 2 a.m. when I am truly desperate for a snack.
Rating: 2
review = "These are wayyy too sugary for me. With all the added sugars, I don't really think these can even still be considered healthy. But since it's under the Cheerios name, many will just associate this with a healthy morning option. Definitely not a great start to your day, unless you were planning on working a sugar crash into your busy schedule. I never actually buy this cereal directly, I always end up getting it when I purchase those mini cereal variety boxes, and then I eat it at 2 a.m. when I am truly desperate for a snack. These are wayyy too sugary for me. With all the added sugars, I don't really think these can even still be considered healthy. But since it's under the Cheerios name, many will just associate this with a healthy morning option. Definitely not a great start to your day, unless you were planning on working a sugar crash into your busy schedule. I never actually buy this cereal directly, I always end up getting it when I purchase those mini cereal variety boxes, and then I eat it at 2 a.m. when I am truly desperate for a snack."
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Pretty Good !!! ")
Question 7.6: How does the model perform on short negative review with overwhelmingly negative sentiments?
Review: "I don't think these are good at all. Someone recommend these to me to try. I will not buy these ever.."
Rating: 1
review = "I don't think these are good at all. Someone recommend these to me to try. I will not buy these ever.."
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Excellent !!! ")
In this section I am exploring the possibility of using learned weights from the model to predict sentiments in different problem domains.
Question 8.1.1: How does the model perform on positive review ?
Review: I love to treat myself for lunch here! Sandwiches are 5.99, drinks 1, combo 9.99 and that includes two sides and a drink (pop or juice). It's also delicious and the staff are kind. I almost always get a combo with potatoes, salad and the falafel wrap. It's very filling! The potatoes are roasted and seasoned wonderfully and they can put a creamy garlic sauce over them. The salad is spiced with vinegar and roasted thyme. YUM! The falafel is perfect! Crunchy outside, soft inside, seasoned, it's great. The rice is the only thing I'm not excited about. It doesn't have as strong a flavor as I would like but if you like a blander taste you will probably enjoy it. Overall, very affordable and at a great cost. I could not recommend this place more.
Rating: 5
Source - Read Fushcia H.'s review of Al-Madina Market & Grill on Yelp
review = "I love to treat myself for lunch here! Sandwiches are 5.99, drinks 1, combo 9.99 and that includes two sides and a drink (pop or juice). It's also delicious and the staff are kind. I almost always get a combo with potatoes, salad and the falafel wrap. It's very filling! The potatoes are roasted and seasoned wonderfully and they can put a creamy garlic sauce over them. The salad is spiced with vinegar and roasted thyme. YUM! The falafel is perfect! Crunchy outside, soft inside, seasoned, it's great. The rice is the only thing I'm not excited about. It doesn't have as strong a flavor as I would like but if you like a blander taste you will probably enjoy it. Overall, very affordable and at a great cost. I could not recommend this place more."
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Awesome like Al-Madina!!! ")
Question 8.1.2: How does the model perform on negative review ?
Review: Extremely poor service. They have no servers. They take an order and don't even bring it. So you are left with no food on the table.
We were there today and ordered for bhindi masala and hariyali kofta and Naan. The haryali kofta came in with naan, We had to remind the server twice to bring in the rice.Then we were waiting for the bhindi masala to be brought to the table. In the mean while the other orders were being brought out and no sign of our second entree. We kept reminding them about the order and all we got was yes it is getting ready. By the third reminder I was so upset and told him he should cancel the bhindi masala order as we were almost done eating. The nerve he had to tell us "oh we are just bringing it out". I was really upset about it, we had waited for almost 30 mins for the bhindi masala.
We just finished our one entree and left, there was no sorry nothing from the person taking the bills. That's pretty rude.
Don't know how long this restaurant can function with this kind of service.
Would like to put Zero stars for this like of service.
Rating: 1
Source - Read Avisha G.'s review of Ravis Hyderabad House on Yelp
review = "Extremely poor service. They have no servers. They take an order and don't even bring it. So you are left with no food on the table. We were there today and ordered for bhindi masala and hariyali kofta and Naan. The haryali kofta came in with naan, We had to remind the server twice to bring in the rice.Then we were waiting for the bhindi masala to be brought to the table. In the mean while the other orders were being brought out and no sign of our second entree. We kept reminding them about the order and all we got was yes it is getting ready. By the third reminder I was so upset and told him he should cancel the bhindi masala order as we were almost done eating. The nerve he had to tell us oh we are just bringing it out. I was really upset about it, we had waited for almost 30 mins for the bhindi masala. We just finished our one entree and left, there was no sorry nothing from the person taking the bills. That's pretty rude. Don't know how long this restaurant can function with this kind of service. Would like to put Zero stars for this like of service."
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Decent ! ")
Question 8.2.1: How does the model perform on positive review ?
Review: A glorious train wreck that will not let you look away! You should be inside already, sit back and enjoy the show!
Rating: 5
Source - Review by Tom A at Rotten Tomatoes
review = "A glorious train wreck that will not let you look away! You should be inside already, sit back and enjoy the show!"
print(f"{model.predict(GetSingleSequence(review))[0][0]} - 5 stars for the prediction. Roar!!! ")
Question 8.2.2: How does the model perform on negative review ?
Review: Season 8 of Game of Thrones had some of the laziest writing I have ever seen. The writers abandoned every storyline, character storylines/arcs, and reverse engineered everything to get their mad queen narrative. Season 8 quite simply, was an insult to viewers and GOT fans everywhere.
Rating: 0.5
Source - Review by Jess C at Rotten Tomatoes
review = "Season 8 of Game of Thrones had some of the laziest writing I have ever seen. The writers abandoned every storyline, character storylines/arcs, and reverse engineered everything to get their mad queen narrative. Season 8 quite simply, was an insult to viewers and GOT fans everywhere."
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Avearge, Just like GOT Season 8 ! ")
Question 8.3.1: How does the model perform on positive tweet ?
Tweet:
We hope you have a great weekend.
We hope you stay at home.
We hope you stay away from gatherings and practice physical distancing.
We hope you decide to protect others, protect our community.
We hope you take #COVID19 seriously.
InThisTogether
Source - Twitter
review = "We hope you have a great weekend. We hope you stay at home. We hope you stay away from gatherings and practice physical distancing. We hope you decide to protect others, protect our community. We hope you take #COVID19 seriously. #InThisTogether"
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Positive!! ")
Question 8.3.2: How does the model perform on negative tweet ?
Tweet:
@airtelindia
@Airtel_Presence Call back from network team is complete lie. They give a single ring missed call and say they could reach us.
Network provider cannot reach customer on their own network lol. Irony
Source - Twitter
review = "Call back from network team is complete lie. They give a single ring missed call and say they could reach us. Network provider cannot reach customer on their own network lol. Irony"
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Spot On !!! ")
Question 8.4.1: How does the model perform on positive comment ?
Comment:
He’s hugely popular in the squad and his outlook and fun-loving nature generates a sense of collective well-being and togetherness. I know social media isn’t a true barometer of anything, but look at the way he interacts with other players on Instagram and how they respond. It’s not just his so-called mates either, it’s senior players and young players, right throughout the squad.
Source: Arseblog
review = "He’s hugely popular in the squad and his outlook and fun-loving nature generates a sense of collective well-being and togetherness. I know social media isn’t a true barometer of anything, but look at the way he interacts with other players on Instagram and how they respond. It’s not just his so-called mates either, it’s senior players and young players, right throughout the squad."
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Decent! ")
Question 8.4.2: How does the model perform on negative comment ?
Comment:
The issues were endless. Fractious relationships with players; Ozil and Ramsey left out then brought back; a lack of genuine authority despite trying to be an authoritarian when he first took over; no defined style of play; our captain basically destroying his own legacy to get away from the club as quickly as possible this summer (more alarm bells); the indecision over simple things like who should be captain; the Xhaka situation for which the player deserves criticism for his reaction, but under which Emery had lit a fuse that never needed to be lit with his handling of the captaincy; poor communication, and clumsy attempts to connect with fans who had long lost faith; there was just so much in 18 months that it had to come to head.
Source: Arseblog
review = "The issues were endless. Fractious relationships with players; Ozil and Ramsey left out then brought back; a lack of genuine authority despite trying to be an authoritarian when he first took over; no defined style of play; our captain basically destroying his own legacy to get away from the club as quickly as possible this summer (more alarm bells); the indecision over simple things like who should be captain; the Xhaka situation for which the player deserves criticism for his reaction, but under which Emery had lit a fuse that never needed to be lit with his handling of the captaincy; poor communication, and clumsy attempts to connect with fans who had long lost faith; there was just so much in 18 months that it had to come to head."
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Great !! ")
8.5.1: Criticising the critics
Review:
This movie should not be rated by any "critic" this show has a following that spans millions and millions of fans. As a true fan of this show and these guys, I rate it a popping 5 Stars.. The movie was excellent all around. It's just sad to see some guy who probably sat and watched this movie with Oscar Award goggles on. Hahahahahahahahaha... now that was funny.
My Label: Positive Review with criticism for critics
Rating: 5
Model Performance: Although objectively overall sentiment of the statement is mixed but with respect to movie the sentiment is clearly positive, the model is not taking into account the context specific positivity and instead has a generalized approach to sentiments of the statement.
review = "This movie should not be rated by any critic this show has a following that spans millions and millions of fans. As a true fan of this show and these guys, I rate it a popping 5 Stars.. The movie was excellent all around. It's just sad to see some guy who probably sat and watched this movie with Oscar Award goggles on. Hahahahahahahahaha... now that was funny."
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Confused !! ")
8.5.2: Sarcasm
Review:
God Himself could not record as good a greatest hits album like this, and if He were to listen to all 17 tracks on this compilation, He would refrain from striking me down for blasphemy. The song “Hot Shot City” is particularly good.
My Label: Negative
Model Performance: It would be too much to expect the model to understand the sarcasm, especially when text are not accompanied by pauses and tonal modulations. May be in a parallel universe where everyone reviews like Chandler Bing, this model would still likely give same results ;).
review = "God Himself could not record as good a greatest hits album like this, and if He were to listen to all 17 tracks on this compilation, He would refrain from striking me down for blasphemy. The song “Hot Shot City” is particularly good."
print(f"{model.predict(GetSingleSequence(review))[0][0]} - Bazinga !!! ")
From the analysis, it is quite evident that the model developed is well-equipped to appropriately classify the reviews based on its sentiments.
Initially, I started with setting up the dataset in my google drive, as colab doesn't hold datasets in the environment for too long and each time the runtime gets disconnected (which is quite often) I would need to upload the dataset. I then performed EDA on it if there's something that is statistically deceptive in the dataset.
I then cleaned data of any punctuations and html tags. Converted the data into a binary classification problem, where reviews with rating (or score) 1,2 & 3 were labeled as negative sentiments (or 0) and reviews with rating 4 & 5 were labeled as positive sentiments (or 1).
I then fed the data to the Word2Vec method from the Gensim library. Gensim word2vec model then returned a word vector object containing contextualized embedding matrix, vocabs, index2words..etc. I then use the vocab and set up the training data and labels into a network feedable format.
In the next section, I iteratively developed a Recurrent Neural Network model using the super-easy Keras wrapper of the TensorFlow framework. I also assigned appropriate weights to each class to account for the massive class imbalance. After finalizing the model I then analyzed the performance. Overall the model seemed to be doing pretty well (i.e 90% train and validation accuracy), but after careful analysis, it was evident that the model does seem to be doing well on all types of reviews with exception of reviews with rating 4.
Reviews with rating 4 were found to have disproportionately high number of reviews to be incorrectly predicted as negative, but a detailed analysis on samples from different probability distribution revealed that although most of the reviews had rating 4, the overwhelming sentiments in the reviews were either mixed or negative, hence the incorrect prediction in terms of rating. But with respect to sentiments of these reviews, the incorrect prediction actually made more sense.
An explanation why reviews with rating 4 either had negative or mixed sentiments can be that, generally, when people review something they've 2 approaches the bottom-up approach (i.e 0 to 5), where the reviewer starts from zero and keeps adding points incrementally, and the opposite top-down approach (i.e 5 to 0), where one assigns full points to a product and then decreases the points with each failure in the sequential assessment of the product. The fact that the network could learn the second type of reviews without being explictly told (i.e no labels were provided of such nature) shows the immense power of LSTM (Long Short Term Memory) cells and RNN architecture, and it also reminded me of a popular blog post by Andrei Karpathy about The Unreasonable Effectiveness of Recurrent Neural Networks.
Further, I validated the model with previously unseen data from the web, and the model performed as expected. I then tried to predict sentiments on statements or reviews from other domains by pooling in data from different sources, to see if the weights learned can be used for the other tasks as well, and the results were promising.
Although I cannot claim that the learned weights alone can suffice to predict the sentiments from different domains, but certainly one can transfer the learning from one task to another task, given that tasks are of similar nature, and on top of the learned weights one can have a small network of its own which is adept at classifying sentiments for that domain, thereby greatly reducing the time taken for training.
There can be many ways to apply this model, but one application that right away comes to mind is, maybe a company which sells food products can set up web crawlers, which would go out and crawl information about its product from various social media platforms (like Twitter, Facebook, Linkedin..etc), the model then automatically classifies the sentiment of that statement and this can be broadly used to check overall how the product is being received on a daily, weekly or monthly basis, and based on the results, further analysis can be commissioned for various sub-groups.
As far as the classifier is concerned the scope for improvement is less, any significant improvement would require a lengthy process of iteration. However, it would be very useful if I can deploy this model, have it predict the sentiments on various newly generated data and get a sampled and customized report on a daily basis to check if the model is indeed performing as hypothesized by the analysis.